Kaggle의 오픈 데이터에서 강의용 데이터로 쓸만한 회귀분석용 데이터를 찾던 중 이 데이터를 찾게 되었다. 링크
이 데이터셋은 2014년 5월부터 2015년 5월까지 매매된 King County의 집값 데이터를 포함하고 있다.
이 데이터를 활용해서 강의에서 전달하고자 하는 바는 Feature Engineering과 Data Exploration의 힘이다. 정말 간단한 몇 개의 과정을 반복하는 것만으로도 예측 모델의 성능을 어디까지 끌어올릴 수 있는지를 보여주고자 했다.
강의를 위해서 준비한 자료를 공유하는 것이기 때문에, 자세한 코멘트는 달지 않았다. 추후 시간을 더 들여서 자세한 설명을 담은 포스트를 작성하고자 한다.
### Load the libraries
library(lubridate)
Attaching package: 'lubridate'
The following object is masked from 'package:base':
date
library(readr)
library(dplyr)
library(ggplot2)
library(GGally)
Attaching package: 'GGally'
The following object is masked from 'package:dplyr':
nasa
library(corrplot)
library(ggmap)
### Load the dataset
House <- read_csv("data/king_county/kc_house_data.csv")
Parsed with column specification:
cols(
.default = col_integer(),
id = col_character(),
date = col_datetime(format = ""),
price = col_double(),
bathrooms = col_double(),
floors = col_double(),
lat = col_double(),
long = col_double()
)
See spec(...) for full column specifications.
head(House)
| id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7129300520 | 2014-10-13 | 221900 | 3 | 1.00 | 1180 | 5650 | 1 | 0 | 0 | 3 | 7 | 1180 | 0 | 1955 | 0 | 98178 | 47.5112 | -122.257 | 1340 | 5650 |
| 6414100192 | 2014-12-09 | 538000 | 3 | 2.25 | 2570 | 7242 | 2 | 0 | 0 | 3 | 7 | 2170 | 400 | 1951 | 1991 | 98125 | 47.7210 | -122.319 | 1690 | 7639 |
| 5631500400 | 2015-02-25 | 180000 | 2 | 1.00 | 770 | 10000 | 1 | 0 | 0 | 3 | 6 | 770 | 0 | 1933 | 0 | 98028 | 47.7379 | -122.233 | 2720 | 8062 |
| 2487200875 | 2014-12-09 | 604000 | 4 | 3.00 | 1960 | 5000 | 1 | 0 | 0 | 5 | 7 | 1050 | 910 | 1965 | 0 | 98136 | 47.5208 | -122.393 | 1360 | 5000 |
| 1954400510 | 2015-02-18 | 510000 | 3 | 2.00 | 1680 | 8080 | 1 | 0 | 0 | 3 | 8 | 1680 | 0 | 1987 | 0 | 98074 | 47.6168 | -122.045 | 1800 | 7503 |
| 7237550310 | 2014-05-12 | 1225000 | 4 | 4.50 | 5420 | 101930 | 1 | 0 | 0 | 3 | 11 | 3890 | 1530 | 2001 | 0 | 98053 | 47.6561 | -122.005 | 4760 | 101930 |
### Initialize a map for King County
kingCounty <- get_map(location = 'issaquah',
zoom = 9,
maptype = "roadmap"
)
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=issaquah&zoom=9&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=issaquah&sensor=false
### Generate a heat map
ggmap(kingCounty) +
geom_density2d(data = House, aes(x = long, y = lat), size = .3) +
stat_density2d(data = House, aes(x = long, y = lat, fill = ..level.., alpha = ..level..), size = 0.01, bins = 16, geom = "polygon") +
scale_fill_gradient(low = "green", high = "red") +
scale_alpha(range = c(0.2, 0.4), guide = FALSE)
House %>%
mutate(
sale_year = year(date),
sale_month = month(date)
) %>%
select(-id, -date) -> House
set.seed(2017)
trainIdx <- sample(1:nrow(House), size = 0.7 * nrow(House))
train <- House[trainIdx, ]
test <- House[-trainIdx, ]
bench_model <- lm(price ~ ., data = train)
summary(bench_model)
Call:
lm(formula = price ~ ., data = train)
Residuals:
Min 1Q Median 3Q Max
-1289272 -98433 -9562 76172 4400147
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.428e+07 1.179e+07 -6.301 3.03e-10 ***
bedrooms -3.369e+04 2.203e+03 -15.296 < 2e-16 ***
bathrooms 4.165e+04 3.863e+03 10.782 < 2e-16 ***
sqft_living 1.444e+02 5.198e+00 27.778 < 2e-16 ***
sqft_lot 1.202e-01 5.446e-02 2.207 0.027360 *
floors 5.802e+03 4.248e+03 1.366 0.172069
waterfront 5.701e+05 2.039e+04 27.952 < 2e-16 ***
view 5.602e+04 2.482e+03 22.572 < 2e-16 ***
condition 2.925e+04 2.802e+03 10.439 < 2e-16 ***
grade 9.494e+04 2.551e+03 37.217 < 2e-16 ***
sqft_above 2.731e+01 5.154e+00 5.299 1.18e-07 ***
sqft_basement NA NA NA NA
yr_built -2.489e+03 8.579e+01 -29.018 < 2e-16 ***
yr_renovated 2.151e+01 4.304e+00 4.997 5.89e-07 ***
zipcode -5.576e+02 3.909e+01 -14.265 < 2e-16 ***
lat 6.133e+05 1.268e+04 48.360 < 2e-16 ***
long -2.092e+05 1.559e+04 -13.422 < 2e-16 ***
sqft_living15 2.897e+01 4.061e+00 7.134 1.02e-12 ***
sqft_lot15 -2.804e-01 8.390e-02 -3.342 0.000833 ***
sale_year 3.893e+04 5.581e+03 6.976 3.16e-12 ***
sale_month 1.914e+03 8.338e+02 2.296 0.021713 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 198800 on 15109 degrees of freedom
Multiple R-squared: 0.7017, Adjusted R-squared: 0.7013
F-statistic: 1870 on 19 and 15109 DF, p-value: < 2.2e-16
benchmark <- predict(bench_model, test)
benchmark <- ifelse(benchmark < 0, 0, benchmark)
### Generate a heat map
ggmap(kingCounty) +
geom_point(data = train, aes(x = long, y = lat, color = log(price), alpha = log(price))) +
scale_color_gradient(low = "green", high = "red")
price를 로그화하여 시각화한 결과로, 남부(lat < 47.5)보다 북부(lat >= 47.5) 쪽의 가격이 더 높음을 알 수 있다.cor_House <- cor(House[, -1])
corrplot(cor_House, order = "hclust")
train %>%
mutate(grade = factor(grade)) %>%
ggplot(aes(x = grade, y = price, fill = grade)) +
geom_boxplot() +
geom_point(
data = train %>%
group_by(grade) %>%
summarise(median = median(price)) %>%
mutate(grade = factor(grade)),
aes(x = grade, y = median, group = 1),
size = 5, stroke = 2,
color = "black", fill = "white", shape = 23
)
grade가 한 단계 높아질 때마다 가격이 기하급수적으로 증가하는 것으로 보인다. 확인을 위해서 log(price)에 대해서 박스플롯을 그려본다.train %>%
mutate(grade = factor(grade)) %>%
ggplot(aes(x = grade, y = log(price), fill = grade)) +
geom_boxplot() +
geom_point(
data = train %>%
group_by(grade) %>%
summarise(median = median(log(price))) %>%
mutate(grade = factor(grade)),
aes(x = grade, y = median, group = 1),
size = 5, stroke = 2,
color = "black", fill = "white", shape = 23
)
train %>%
mutate(yr_cat = cut(yr_built, breaks = seq(1900, 2020, by = 10),
labels = paste0(seq(1900, 2010, by = 10), "s"))) %>%
ggplot(aes(x = yr_cat, y = log(price), fill = yr_cat)) +
geom_boxplot()
train %>%
filter(yr_renovated != 0) %>%
mutate(renovated_cat = cut(yr_renovated, breaks = seq(1930, 2020, by = 10),
labels = paste0(seq(1930, 2010, by = 10), "s"))) %>%
ggplot(aes(x = renovated_cat, y = log(price), fill = renovated_cat)) +
geom_boxplot()
train %>%
mutate(isRenovated = factor(ifelse(yr_renovated != 0, 1, 0))) %>%
ggplot(aes(x = isRenovated, y = log(price), fill = isRenovated)) +
geom_boxplot()
train %>%
mutate(sale_year = factor(sale_year)) %>%
ggplot(aes(x = sale_year, y = log(price), fill = sale_year)) +
geom_boxplot()
train %>%
mutate(sale_month = factor(sale_month)) %>%
ggplot(aes(x = sale_month, y = log(price), fill = sale_month)) +
geom_boxplot()
train %>%
mutate(bathrooms = factor(bathrooms)) %>%
ggplot(aes(x = bathrooms, y = log(price), fill = bathrooms)) +
geom_boxplot()
log(price)와 bathrooms는 유사 선형관계를 가진다.train %>%
ggplot(aes(x = lat, y = log(price), color = lat)) +
geom_line() + geom_point(shape = 21)
train %>%
ggplot(aes(x = long, y = log(price), color = long)) +
geom_line() + geom_point(shape = 21)
sort(unique(train$zipcode)) == sort(unique(test$zipcode))
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
train %>%
arrange(zipcode) %>%
mutate(zipcode = factor(zipcode)) %>%
ggplot(aes(x = zipcode, y = log(price), fill = zipcode)) +
geom_boxplot()
splitLat <- function(data){
data <- data %>%
dplyr::mutate(lat1 = ifelse(lat <= 47.5, lat, 0),
lat2 = ifelse(lat > 47.5 & lat <= 47.6, lat, 0),
lat3 = ifelse(lat > 47.6, lat, 0)) %>%
dplyr::select(-lat)
return(data)
}
train <- splitLat(train)
test <- splitLat(test)
train <- train %>%
mutate(isRenovated = ifelse(yr_renovated != 0, 1, 0))
test <- test %>%
mutate(isRenovated = ifelse(yr_renovated != 0, 1, 0))
train <- train %>%
mutate(age = ifelse(yr_renovated != 0, 2016 - yr_renovated, 2016 - yr_built))
test <- test %>%
mutate(age = ifelse(yr_renovated != 0, 2016 - yr_renovated, 2016 - yr_built))
train$zipcode <- factor(train$zipcode)
test$zipcode <- factor(test$zipcode)
zipcode_train <- data.frame(model.matrix(price ~ 0 + zipcode, data = train))
zipcode_test <- data.frame(model.matrix(price ~ 0 + zipcode, data = test))
train <- train %>%
select(-zipcode) %>%
cbind(zipcode_train)
test <- test %>%
select(-zipcode) %>%
cbind(zipcode_test)
model <- lm(log(price) ~ . - age, data = train)
summary(model)
Call:
lm(formula = log(price) ~ . - age, data = train)
Residuals:
Min 1Q Median 3Q Max
-1.34416 -0.09529 0.00901 0.10456 1.00032
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.714e+02 1.346e+01 -12.730 < 2e-16 ***
bedrooms 3.479e-03 2.059e-03 1.689 0.091173 .
bathrooms 3.477e-02 3.599e-03 9.661 < 2e-16 ***
sqft_living 1.293e-04 4.842e-06 26.704 < 2e-16 ***
sqft_lot 5.834e-07 5.052e-08 11.547 < 2e-16 ***
floors -2.649e-02 4.293e-03 -6.170 6.99e-10 ***
waterfront 4.806e-01 1.919e-02 25.050 < 2e-16 ***
view 5.914e-02 2.350e-03 25.163 < 2e-16 ***
condition 6.474e-02 2.656e-03 24.372 < 2e-16 ***
grade 8.779e-02 2.486e-03 35.317 < 2e-16 ***
sqft_above 7.064e-05 4.956e-06 14.255 < 2e-16 ***
sqft_basement NA NA NA NA
yr_built -1.922e-04 8.836e-05 -2.175 0.029630 *
yr_renovated 3.774e-03 4.898e-04 7.706 1.38e-14 ***
long -2.737e-01 6.348e-02 -4.312 1.63e-05 ***
sqft_living15 8.924e-05 3.933e-06 22.692 < 2e-16 ***
sqft_lot15 2.001e-07 7.989e-08 2.505 0.012267 *
sale_year 6.783e-02 5.156e-03 13.157 < 2e-16 ***
sale_month 3.005e-03 7.699e-04 3.904 9.52e-05 ***
lat1 2.803e-01 9.154e-02 3.062 0.002203 **
lat2 2.827e-01 9.148e-02 3.090 0.002005 **
lat3 2.830e-01 9.142e-02 3.096 0.001967 **
isRenovated -7.437e+00 9.778e-01 -7.606 2.99e-14 ***
zipcode98001 -6.005e-01 3.569e-02 -16.826 < 2e-16 ***
zipcode98002 -6.169e-01 3.805e-02 -16.212 < 2e-16 ***
zipcode98003 -5.981e-01 3.546e-02 -16.869 < 2e-16 ***
zipcode98004 2.899e-01 2.205e-02 13.144 < 2e-16 ***
zipcode98005 -4.905e-02 2.638e-02 -1.859 0.062979 .
zipcode98006 -1.368e-01 2.759e-02 -4.957 7.23e-07 ***
zipcode98007 -1.256e-01 2.798e-02 -4.491 7.14e-06 ***
zipcode98008 -1.272e-01 2.610e-02 -4.872 1.11e-06 ***
zipcode98010 -2.845e-01 4.438e-02 -6.410 1.50e-10 ***
zipcode98011 -3.850e-01 2.631e-02 -14.635 < 2e-16 ***
zipcode98014 -3.793e-01 4.221e-02 -8.985 < 2e-16 ***
zipcode98019 -4.399e-01 3.589e-02 -12.257 < 2e-16 ***
zipcode98022 -4.542e-01 4.833e-02 -9.398 < 2e-16 ***
zipcode98023 -6.624e-01 3.506e-02 -18.894 < 2e-16 ***
zipcode98024 -2.737e-01 4.448e-02 -6.154 7.75e-10 ***
zipcode98027 -1.882e-01 3.233e-02 -5.819 6.03e-09 ***
zipcode98028 -4.215e-01 2.326e-02 -18.120 < 2e-16 ***
zipcode98029 -1.143e-01 3.417e-02 -3.345 0.000825 ***
zipcode98030 -5.425e-01 3.469e-02 -15.638 < 2e-16 ***
zipcode98031 -5.334e-01 3.254e-02 -16.393 < 2e-16 ***
zipcode98032 -6.521e-01 3.605e-02 -18.087 < 2e-16 ***
zipcode98033 -3.732e-02 2.174e-02 -1.717 0.086068 .
zipcode98034 -2.916e-01 2.130e-02 -13.687 < 2e-16 ***
zipcode98038 -3.838e-01 3.739e-02 -10.265 < 2e-16 ***
zipcode98039 4.156e-01 3.461e-02 12.009 < 2e-16 ***
zipcode98040 7.704e-02 2.650e-02 2.907 0.003657 **
zipcode98042 -5.180e-01 3.494e-02 -14.824 < 2e-16 ***
zipcode98045 -1.803e-01 4.814e-02 -3.745 0.000181 ***
zipcode98052 -1.571e-01 2.412e-02 -6.513 7.62e-11 ***
zipcode98053 -1.793e-01 2.955e-02 -6.067 1.34e-09 ***
zipcode98055 -4.773e-01 3.013e-02 -15.839 < 2e-16 ***
zipcode98056 -4.082e-01 2.752e-02 -14.830 < 2e-16 ***
zipcode98058 -4.547e-01 3.093e-02 -14.702 < 2e-16 ***
zipcode98059 -3.241e-01 2.930e-02 -11.062 < 2e-16 ***
zipcode98065 -2.801e-01 4.154e-02 -6.742 1.62e-11 ***
zipcode98070 -3.817e-01 3.460e-02 -11.033 < 2e-16 ***
zipcode98072 -3.325e-01 2.740e-02 -12.135 < 2e-16 ***
zipcode98074 -2.092e-01 2.808e-02 -7.449 9.93e-14 ***
zipcode98075 -1.881e-01 3.287e-02 -5.723 1.07e-08 ***
zipcode98077 -3.620e-01 3.144e-02 -11.514 < 2e-16 ***
zipcode98092 -5.397e-01 3.801e-02 -14.199 < 2e-16 ***
zipcode98102 1.037e-01 2.557e-02 4.055 5.04e-05 ***
zipcode98103 -4.703e-02 1.643e-02 -2.863 0.004201 **
zipcode98105 8.999e-02 2.070e-02 4.348 1.38e-05 ***
zipcode98106 -4.844e-01 2.412e-02 -20.081 < 2e-16 ***
zipcode98107 -3.260e-02 1.897e-02 -1.718 0.085781 .
zipcode98108 -4.541e-01 2.633e-02 -17.243 < 2e-16 ***
zipcode98109 1.566e-01 2.431e-02 6.442 1.22e-10 ***
zipcode98112 2.175e-01 1.997e-02 10.893 < 2e-16 ***
zipcode98115 -3.645e-02 1.718e-02 -2.122 0.033879 *
zipcode98116 -6.941e-02 2.326e-02 -2.985 0.002844 **
zipcode98117 -6.218e-02 1.637e-02 -3.799 0.000146 ***
zipcode98118 -3.441e-01 2.410e-02 -14.274 < 2e-16 ***
zipcode98119 1.244e-01 2.082e-02 5.975 2.35e-09 ***
zipcode98122 -3.109e-02 1.934e-02 -1.608 0.107939
zipcode98125 -2.822e-01 1.917e-02 -14.719 < 2e-16 ***
zipcode98126 -2.726e-01 2.384e-02 -11.438 < 2e-16 ***
zipcode98133 -4.177e-01 1.852e-02 -22.556 < 2e-16 ***
zipcode98136 -1.398e-01 2.514e-02 -5.559 2.76e-08 ***
zipcode98144 -1.551e-01 2.358e-02 -6.578 4.92e-11 ***
zipcode98146 -4.871e-01 2.546e-02 -19.131 < 2e-16 ***
zipcode98148 -4.666e-01 3.982e-02 -11.717 < 2e-16 ***
zipcode98155 -4.417e-01 2.033e-02 -21.720 < 2e-16 ***
zipcode98166 -3.702e-01 2.870e-02 -12.898 < 2e-16 ***
zipcode98168 -6.160e-01 2.705e-02 -22.774 < 2e-16 ***
zipcode98177 -2.814e-01 2.055e-02 -13.693 < 2e-16 ***
zipcode98178 -5.612e-01 2.746e-02 -20.436 < 2e-16 ***
zipcode98188 -5.367e-01 3.216e-02 -16.688 < 2e-16 ***
zipcode98198 -5.940e-01 3.122e-02 -19.028 < 2e-16 ***
zipcode98199 NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1832 on 15038 degrees of freedom
Multiple R-squared: 0.8798, Adjusted R-squared: 0.8791
F-statistic: 1223 on 90 and 15038 DF, p-value: < 2.2e-16
평가 메트릭으로 Root Mean Squared Logarithmic Error(RMSLE)를 사용한다. 해당 메트릭은 과대평가된 항목보다는 과소평가된 항목에 페널티를 준다.
\[ RMSLE = \sqrt{\frac{1}{n} \sum^n_{i=1} \left( \log(p_i + 1) - \log(a_i + 1)\right)^2} \]
rmsle <- function(predict, actual){
if(length(predict) != length(actual))
stop("The length of two vectors are different.")
len <- length(predict)
rmsle <- sqrt((1/len) * sum((log(predict + 1) - log(actual + 1))^2))
return(rmsle)
}
pred <- predict(model, test)
pred <- exp(pred)
result <- rmsle(pred, test$price)
benchmark_result <- rmsle(benchmark, test$price)
cat("RMSLE (Benchmark): ", benchmark_result, "\nRMSLE (Final): ", result)
RMSLE (Benchmark): 0.9280684
RMSLE (Final): 0.1794783